Skip to content

Conversation

@hjc4869
Copy link
Contributor

@hjc4869 hjc4869 commented May 8, 2025

This change is to address issue #13241

Including llama-bench support to help with performance tuning

@github-actions github-actions bot added testing Everything test related examples ggml changes relating to the ggml tensor library for machine learning labels May 8, 2025
@Panchovix
Copy link

Hi there! I tried this PR (applied the changes on top of the last commit, to use MLA + FA) on DeepSeek V3 0324, but I noticed less PP performance. I think it doesn't saturates the PCI-E for the main GPU when using the flag vs without, which results in less PP t/s.

Command to run was

./llama-server -m '/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.([0-7])\..*_exps\.=CUDA0' --override-tensor 'blk\.([8-9]|1[0-1])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[2-6])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[7-9]|2[0-6])\..*_exps\.=CUDA3' -fa --override-tensor 'blk\..*_exps\.=CPU' -mg 0 --ubatch-size 1024 --disable-op-offload

RX/TX usage without the flag (GPU 0 gets saturated) while doing PP

image

RX/TX usage with the flag while doing PP

image

So PP t/s go from 66 t/s to 26 t/s

No flag:

prompt eval time =   35950.29 ms /  3218 tokens (   11.17 ms per token,    89.51 tokens per second)
       eval time =   44338.15 ms /   380 tokens (  116.68 ms per token,     8.57 tokens per second)

Flag:

prompt eval time =  122421.67 ms /  3218 tokens (   38.04 ms per token,    26.29 tokens per second)
       eval time =   49715.68 ms /   440 tokens (  112.99 ms per token,     8.85 tokens per second)

Maybe there is an incompatible flag I'm using?

@hjc4869
Copy link
Contributor Author

hjc4869 commented May 9, 2025

Using the flag should place way more load on the CPU, and whether that's beneficial at all should be highly specific to model offload params and hardware config.

For now I personally only tested llama 4 and qwen 3 with a relatively performant CPU (7970X) and a single relatively weak GPU (W7900), with a simple exps=CPU ot config, so YMMV. There's already a huge variance in terms of perf uplift in the 2 models I tested.

./build/bin/llama-bench -m ~/models/llama4-400b-hybrid-q8_0-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 --disable-op-offload 1,0 -n 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon PRO W7900 Dual Slot , gfx1100 (0x1100), VMM: no, Wave Size: 32

model size params backend ngl type_k type_v fa ot mmap dopo test t/s
llama4 17Bx128E (Maverick) Q8_0 216.56 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 1 pp512 232.02 ± 1.04
llama4 17Bx128E (Maverick) Q8_0 216.56 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 0 pp512 27.44 ± 0.04

./build/bin/llama-bench -m ~/models/qwen3-235b-a22b-q8_0-q4_0-hybrid.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 --disable-op-offload 1,0 -n 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon PRO W7900 Dual Slot , gfx1100 (0x1100), VMM: no, Wave Size: 32

model size params backend ngl type_k type_v fa ot mmap dopo test t/s
qwen3moe 235B.A22B Q8_0 127.02 GiB 235.09 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 1 pp512 61.27 ± 0.13
qwen3moe 235B.A22B Q8_0 127.02 GiB 235.09 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 0 pp512 43.00 ± 0.09

@Panchovix
Copy link

Ah then that could be the reason then. I have 192GB RAM on a Ryzen 7 7800X3D, which is a consumer CPU so it is pretty weak for these tasks.

@jukofyork
Copy link
Collaborator

jukofyork commented May 10, 2025

Could we not just parameterise the fixed 32 batch size threshold to offload? This would still let you disable offloading by setting to a very large value, but also allow more nuanced setting?

@slaren
Copy link
Member

slaren commented May 10, 2025

A setting to control the minimum batch size would need to be per-backend, and configured via an environment variable.

@jukofyork
Copy link
Collaborator

A setting to control the minimum batch size would need to be per-backend, and configured via an environment variable.

Ah, sorry I forgot the 32 limit I was thinking of is CUDA specific.

@hjc4869 hjc4869 changed the title Add --disable-op-offload to improve -ot pp perf in MoE models like llama4 400B Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B May 11, 2025
@slaren slaren merged commit 7f323a5 into ggml-org:master May 11, 2025
1 check passed
@hjc4869 hjc4869 deleted the no_op_offload branch May 11, 2025 13:24
@iSevenDays
Copy link

FYI: using --no-op-offload on Dell R740 with two Nvidia 4090D 48G slows down prompt processing a lot
with the flag

prompt eval time =   78715.84 ms /  5480 tokens (   14.36 ms per token,    69.62 tokens per second)
eval time =   36350.14 ms /   195 tokens (  186.41 ms per token,     5.36 tokens per second)
total time =  115065.99 ms /  5675 tokens

without the flag

prompt eval time =  268849.41 ms / 75130 tokens (    3.58 ms per token,   279.45 tokens per second)
eval time =    6960.09 ms /    39 tokens (  178.46 ms per token,     5.60 tokens per second)
total time =  275809.51 ms / 75169 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Allow disabling offload_op for backends by user

5 participants